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Abstract 

We study a general online convex optimization problem. We have a convex set S and an 
unknown sequence of cost functions ci,C2, . . . , and in each round, we choose a feasible point 
Xt in S, and learn the cost ct{xt). If the function Cf is also revealed after each round then, as 
Zinkevich shows in |23) . gradient descent can be used on these functions to get regret bounds 
of 0{^/n). That is, after n rounds, the total cost incurred will be 0{y/n) more than the cost of 
the best single feasible decision chosen with the benefit of hindsight, min-cgs J2(^t{x)- 

We extend this to the "bandit" setting where each period, only the cost Ct{xt) is revealed, 
and bound the expected regret (against an oblivious adversary) as 0{n^/^). 

Our approach uses a simple approximation of the gradient that is computed from evaluating 
Ct at a single (random) point. We show that this biased estimate is sufficient to approximate 
gradient descent on the sequence of functions. In other words, it is possible to use gradient 
descent in the online setting without seeing anything more than the value of the functions at a 
single point. 

For the online linear optimization problem 14, . algorithms with low regrets in the bandit 
setting have recently been given against oblivious and adaptive adversaries [1]. In contrast 
to these algorithms, which divide time into explicit explore and exploit phases, our algorithm 
can be interpreted as doing a small amount of exploration in each round. 



1 Introduction 

Consider three optimization settings where one would hke to minimize a convex function (equiv- 
alently maximize a concave function). In ah three settings, gradient descent is one of the most 
popular methods. 

1. Offline: Minimize a fixed convex cost function c: M"^ — > M. In this case, gradient descent is 
Xt+l = Xt- 7f7c{xt). 

2. Stochastic: Minimize a fixed convex cost function c given only "noisy" access to c, for example, 
we can only get ct{x) = c{x) + et{x) for zero-mean error random error et{x). Here, stochastic 
gradient descent is xt+i = xt — r]'Vct{xt). (The intuition is that the expected gradient is 
correct, i.e. E[Vct(x)] = V E[ct(x)] = Vc(x).) In non-convex cases, the additional randomness 
may actually help avoid local minima in a manner similar to Simulated Annealing \12\ . 

3. Online: Minimize an unknown sequence of convex functions, ci, C2, . . . , i.e. choose a sequence 
xi,X2, ■ ■ ■ where each xt only depends on xi,X2, • • • , xt-i and ci, C2, . . . , ct-i- The goals is to 
have low regret ^Q(xt) — min^ct(a;) for not using the best single point, chosen with the 
benefit of hindsight. In this setting, Zinkevich analyzes the regret of gradient descent given 
by xt+i = Xt- rjVctixt). 
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We will focus primarily on gradient descent in a "bandit" version of the online setting. As 
a motivating example, consider a company that has to decide, every week, how much to spend 
advertising on each of a d different channels, represented as a vector xt S M'^. At the end of 
each week, they calculate their total profit pt{xt). In the offline case, one might assume that each 
week the function pi,p2) • • • are identical. In the stochastic case, one might assume that different 
weeks will have profit functions, but the pt{x) will be noisy realizations of some true underlying 
profit function, for example pt{x) = p{x) + et{x)^ where et{x) has mean 0. In the online case, 
no assumptions are made about a distribution over convex profit functions and instead they are 
modeled as the malicious choices of an (oblivious) adversary. This allows, for example, for the 
possibility of a bad economy which cause the profits to crash. 

In this paper, we consider the bandit case where we only have black-box access to the function(s) 
and thus cannot access the gradient of cj directly for gradient descent. (In the advertising example, 
the advertisers only find out the total profit of their chosen xt, and not how much they would have 
profited from other values of x.) This type of optimization is sometimes referred to as direct or 
gradient-free. 

A natural approach in the black-box case, for all three settings, would be to estimate the 
gradient by evaluating the function at several places around the point, and from them estimate the 
gradient (see Finite Difference Stochastic Approximation, e.g. Chapter 6 of However, in the 

online setting, the functions change adversarially over time and we only can evaluate each function 
once. We use a one-point estimate of the gradient to sidestep these difficulties. 



1.1 A one-point estimate to the gradient 

Our estimate is based on the observation that for a uniformly random unit vector u, 

V/(x) ^ B[{f{x + 5u)- f{x))u]d/5 (1) 
= ¥.[f{x + 5u)u]d/8 (2) 

The first line looks more like an approximation of the gradient than the second. But because u is 
uniformly random over the sphere, in expectation the second term in the first line is zero. Thus, 
it would seem that on average, the vector {d/5)f{x + 5u)u is an estimate of the gradient with low 
bias, and thus we say loosely that it is an approximation to the gradient. 

To make this precise, we show in Section |2l that {d/5)f{x + 6u)u is an unbiased estimator the 
gradient of a smoothed version of /, where the value of at x is replaced by the average over a ball 
of radius 5 around x. For a vector v selected uniformly at random from the unit ball, let 

f{x) = nf{x + 6v)]. 

Then 

V/(x) = E[/(2; + 6u)u]d/5. 

Interestingly, this does not require that / be differentiable. 

Our method of obtaining a one-point estimate of the gradient is similar to a one-point estimates 
proposed independently by by Granichin 8 and Spall [20]. Spall's estimate uses a perturbation 
vector p, in which each entry is a zero-mean independent random variable, to produce an estimate 



of the gradient g{x) = -^(^^^p^ J_^J_^___^^ , This estimate is more of a direct attempt to 



11 1 



estimate the gradient coordinatewise and is not rotationally invariant. Spall's analysis focuses on 
the stochastic setting and requires that the function is three-times differentiable. In j^], Granichin 
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shows that a similar approximation is sufficient to perform gradient descent in a very general 
stochastic model. 

Unlike [211^11201) we work in an adversarial model, where instead of trying to make the restrictions 
on the randomness of nature as weak as possible, we pessimistically assume that nature is conspiring 
against us. Even in the (oblivious) adversarial setting a one-point estimate of the gradient is 
sufficient to make gradient descent work. 



1.2 Guarantees and analysis outline 

We use the following online bandit version of Zinkevich's model. There is a fixed unknown sequence 
of convex functions ci, C2, . . . , : S — > [ — C, C], where C > and 5* CI M is a convex feasible set. 
The decision-maker sequentially chooses points xi,X2, ■ ■ ■ ,Xn £ S. After xt is chosen, the value 
ct{xt) is revealed, and xt+i must be chosen only based on xi,X2, ■ ■ ■ ,xt and ci{xi),C2{x2), ■ ■ ■ , ct{xt) 
(and private randomness). 

Zinkevich shows that, when the gradient Vct{xt) is given to the decision-maker after each round, 
an online gradient descent algorithm guarantees, 

n n 

regret = > cAxt) — min > cAx) < DG\/n. (3) 
t=i t=i 

Here D is the diameter of the feasible set, and G is an upper bound on the magnitudes of the 
gradients. 

By elaborating on his technique, we present update rules for computing a sequence of xt+i in 
the absence of Vq(xj), that give the following guarantee on expected regret: 



E 



t=i 



y^ctixt) -min Vci(x) < Gn^^^dC 



t 



Notice we have replaced the differentiability and bounded gradient assumptions by bounded func- 
tion assumptions. As expected, our guarantees in the bandit setting are worse than those of the 
full- information setting: 0(n^/^) instead of 0{n^^'^). If we make an additional assumption that 
the functions satisfy an L-Lipschitz condition (which is less restrictive than a bounded gradient 
assumption), then we can reduce expected regret to 0(n^/^): 



E 



^ctixt) 



t=i 



mm ctix) < 6n^/^d (VCLD + C 



t=i 



To prove these bounds, we have several pieces to put together. First of all, we show that Zinkevich's 
guarantee (jSJ holds unmodified for vectors that are unbiased estimates of the gradients. Here G 
becomes an upper bound on the magnitude of the estimates. 

Now, the updates should roughly be of the form xt+i = xt — rj{d/5) 'Ei[ct{xt + 5ut)ut\. Since we 
can only evaluate each function at one point, that point should be xt + Sut- However, our analysis 
applies to bound (^t{xt) and not ^ ct{xt + 6ut). Fortunately, these points are close together and 
thus these values should not be too different. 

Another problem that arises is that the perturbations may move points outside the feasible set. 
To deal with these issues, we stay on a subset of the set such that the ball of radius 6 around each 
point in the subset is contained in S. In order to do this, it is helpful to have bounds on the radii 
r, R of balls that are contained in S and that contain S, respectively. Then guarantees can be given 
in terms of R/r. Finally, we can use existing algorithms 17' to reshape the body so R/r < d to 
get the final results. 
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1.3 Related work 



For direct offline optimization, i.e. from an oracle that evaluates the function, in theory one can 
use the ellipsoid 11 or more recent random- walk based approaches [1]. In black-box optimiza- 
tion, practitioners often use Simulated Annealing ^2] or finite difference/simulated perturbation 
stochastic approximation methods (see, for example, In the case that the functions may 

change dramatically over time, a single-point approximation to the gradient may be necessary. 
Granichin and Spall propose a different single-point estimate of the gradient [51 OH]. 

In addition to the appeal of an online model of convex optimization, Zinkevich's gradient descent 
analysis can be applied to several other online problems for which gradient descent and other special- 
purpose algorithms have been carefully analyzed, such as Universal Portfolios [01 IKH ll^j . online 
linear regression ^^^d online shortest paths j22j (one convexifies to get an online shortest flow 
problem) . 

A similar line of research has developed for the problem of online linear optimization [141 ^ [Sj . 
Here, one wants to solve the related but incomparable problem of optimizing a sequence of linear 
functions, over a possibly non-convex feasible set, modeling problems such as online shortest paths 
and online binary search trees (which are difficult to convexify). Kalai and Vempala [14j show that, 
for such linear optimization problems in general, if the offline optimization problem is solvable 
efficiently, then regret can be bounded by 0{^/n) also by an efficient online algorithm, in the full- 
information model. Awerbuch and Kleinberg generalize this to the bandit setting against an 
oblivious adversary (like ours). Blum and McMahan ^ give a simpler algorithm that applies to 
adaptive adversaries, that may choose their functions ct depending on the previous points. 

A few comparisons are interesting to make with the online linear optimization problem. First 
of all, for the bandit versions of the linear problems, there was a distinction between exploration 
phases and exploitation phases. During exploration phases, one action from a bary centric spanner 
^ basis of d actions was chosen, for the sole purpose of estimating the linear objective function. In 
contrast, our algorithm does a little bit of exploration each time. Secondly, Blum and McMahan 
[3] were able to compete against an adaptive adversary, using a careful Martingale analysis. It is 
not clear if that can be done in our setting. 

1.4 Notation 

Let B and S be the unit ball and sphere centered around the origin in d dimensions, respectively, 

M = {x e M.'^ \\x\ < 1} 

§ = {x e M'^ I |x| = 1} 

The ball and sphere of radius a are aB and aS, correspondingly. 

The sequence of functions ci , C2 , • • • Cn : S — > M are fixed in advance (we only handle such an 
oblivious adversary, not an adaptive one). The sequence of points we pick is xi,X2, ■ ■ ■ ,Xn- For 
bandit algorithms, we need to be randomized, so we consider our expected regret: 



Zinkevich assumes the existence of a projection oracle P5(x), projecting the point x onto the 
nearest point in the convex set S, 




Psix) = argminjx — z 



21 
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Projecting onto the set is an elegant way to handle the situation that the gradient takes one outside 
of the set, and is a common trick in the optimization literature. Note that computing is "only" 
an offline convex optimization problem. While for arbitrary feasible sets, this may seem difficult, 
for standard shapes, such as cube, ball, simplex, etc., the calculation is quite straightforward. 
A function / is L-Lipschitz if 

\f{x)-f{y)\<L\x-y\, 

for all x,y in the domain of /. 

We assume S contains the ball of radius r centered at the origin and is contained in the ball of 
radius R, i.e., 

rMCSC RM. 



2 Approximating the gradient with a single sample 

The main observation of this section is that we can estimate the gradient of a function / by taking 
a random unit vector u and scaling it by f{x + 5u), i.e. g = f{x + 5u)u. The approximation is 
correct in the sense that E[^] is proportional to the gradient of a smoothed version of /. For any 
function /, for v random from the unit ball, define 

f{x) = E,eM[f{x + 5v)]. (4) 

Lemma 1. Fix 6 > 0, over random unit vectors u, 

Eu&[f{x + 5u)u] = ^-Vf{x). 
Proof. If d = 1, then the fundamental theorem of calculus implies, 

^ f{x + v)dv = fix + 5)- fix - 5). 
The d-dimensional generalization, following from Stoke's theorem, is, 

v/ fix + v)dv= / fix + u)-^du. (5) 

By definition. 
Similarly, 

fxs fix + U) ■ JTlTdu 

Combining Eq.'s ©, ©, and (jT)), and the fact that ratio of volume to surface area of a d-dimensional 
ball of radius 6 is 5/d gives the lemma. □ 

Notice that the function / is differentiable even when / is not. 
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3 Expected Gradient Descent 

First we consider a version of gradient descent where each step t we get a random vector gt with 
expectation equal to the gradient. Then we can still use Zinkevich's online analysis of gradient 
descent. For lack of a better choice, we use the starting point xi = 0, the center of a containing 
ball of radius R < D and xt+\ = 'Ps{xt — iigt). 

Lemma 2. Let ci, C2, . . . , c„ : S M be a sequence of convex, dijferentiable functions. Let 
e S be defined by xi = and xt+i = Psixt - rjgt), where r] > and gi,...,gn 
are vector-valued random variables with E[gt \ xt] = Vct{xt) and \\gt\\ < G, for some G > (this 
also implies ||Vct(x)|| < G). Then, forrj = -^t^, 



Proof. Let be a point in S minimizing Yl^^i ct{x). 

Since ct is convex and differentiable, we can bound the difference between ct{xt) and ct{xi,) in 
terms of the gradient. 



Following Zinkevich's analysis, we use \\xt — as a potential function. Since S is convex, for any 
X we have || Ps{x) — < — a;*||. So 




Ct{xt) - Ct{Xi,) < Vct{xt) ■ {Xt - Xi,) 

= E[gt\ xt] ■ (xt - X*) 
= E[5t • {xt - Xi,) I Xt] 



Taking the expectation on both sides of this inequality yields 



B[ct{xt) - ct(x*)] < B[gt ■ {xt - x*)]. 



(8) 



||xt+i - = II Ps{xt - V9t) - Xi, 

< \\xt - mt - Xi,f 



< \\xt 



= \\xt 



x*f + »7^||5tf - 2??(xt - X*) • gt 
Xi,\f + rf'G'^ - 2rj{xt - x*) • gt- 



After rearranging terms, we have 



gt ■ {xt - Xi,) < 



Xt - x^ll^ - ||xf+i ~ x^p + rj^G'^ 
2ry 



(9) 
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By putting Eq. Q and Eq. © together we see that 
E 



t=i 



t=i 



t=i 

n 

< E[gt ■ {xt - x^)] 
t=i 

n 



t=l 



E 



\Xt x^ 



xt+i + 772(72 



\Xl - 



2ri 

2 ^2Q2^ 



27] 



+ n- 



2rj 



< h n- 

- 2r/ 2 



The last step follows because we chose xi = and S C RE. Plugging in r] = R/G^/n gives the 
lemma. □ 

3.1 Algorithm and analysis 

In this section, we analyze the algorithm given in Figure ^ 

BGD(a,(5, u) 

• yi = 

• At each period t: 

- select unit vector ut uniformly at random 

- xt := yt + 5ut 

- yt+l ■■= P{l-a)s(yt - VCt{xt)ut) 

Figure 1: Bandit gradient descent algorithm 

We begin with a few observations. 
Observation 1. The optimum in (1 — a) S is near the optimum in S, 



min > Cf (x) < 2aCn + min > c+l 



x . 



Proof. Clearly (1 - a)S C S. Also, 

n n 

min > ct(x)=min> ct((l — a)x). 
And since each q is convex and G S, we have 

n n 



min > ct((l — a)x) < min > act(O) + (1 — a)ct{ 



t=l 



t=l 



miny'a(Q(0) - ct{x)) + 



ct{x 



t=l 
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Finally, since for any y £ S and t £ {1, . . . , n} we have |Q(y)| < C, we may conclude that 

n n 

min> a(Q(0) — ct(x)) + < min > a2C + ct{x). 
x&S ^ — ' xSS ^ — ' 

t=l t=l 

□ 

Observation 2. For any point x in {1 — a)S the ball of radius ar centered at x is contained in S. 
Proof. Since rM Q S and S is convex, we have 

(1 - a)S + arM C (1 - a)S + aS = S. 

□ 

The next observation establishes a bound on the maximum the function can change in {l — a)S, 
an effective Lipschitz condition. 

Observation 3. For any x in (1 — a)S and any y in S 

\ct{x) - ct{y)\ < —\x - y\. 

ar 



Proof. Let y = x + A. If|A|>ar, the observation follows from |q| < C. Otherwise, let z 
X + a^j^, the point at distance ar from x ir 

know z € S. Also, y = ^-^z + (l - ^ ) x, so. 



+ ar the point at distance ar from x in the direction A. By the previous observation, we 



ct{y) < —ct{z) + ( 1 - — ) ct{x) 
ar \ ar J 

= ct{x) + ^ A 

ar 

2C 

<ctix) + — |A|. □ 
ar 



Now we are ready to select the parameters. 



Theorem 1. For any n > (^^)^ and v = -^j^, ^ = \J ^i2n^ ? = the expected regret 

o/ BGD(j/, 5, a) is upper bounded by 



E 



ct{xt) - min ct{x) < 3Cn^/^ ^dR/r 



t=l t=l 



Proof. We begin by showing that the points xt £ S. Since yt £ {I — a)S, Observation [21 implies 
this fact as long as ^ < a < 1, which is the case for n > {3Rd/2r)'^. 

Suppose we wanted to run the gradient descent algorithm on the functions q defined by (@J), 
and the set {1 — a)S. If we let 

gt = ^ct{xt + Sut)ut 

then (since ut is selected uniformly at random from S) Lemma ^ says E[(^t | x^] = Vct{xt). So 
Lemma 121 applies with the update rule: 

xt+i = P(i_«)5(xt - 7]gt) = P(i_o)s(xt - 7]^ct{x + 6ut)ut), 
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which is exactly the update rule we are using to obtain y^, with r] = v5/d. Since 



\9t\ 



-q(x + 8ut)ut 





< dC/6, 



we can apply Lemma |2 with G = dC/5. By our choice of v, we have r] = R/G^/n, and so the 
expected regret is upper bounded by 



E 



t=l 



Let L = which will act an "effective Lipschitz constant". Notice that for x G (1 — a)S 
Observation 121 shows that \ct{x) — < 6L since ct is an average over inputs within 6 of x. Since 

\yt — xt\ =5, Observation IHl also shows that 

\ct{yt) - ctixt)\ < \ct{yt) - ctiyt)\ + \ct{yt) - ct(xt)| < 25L. 
These with the above imply, 

j2{ctixt)-25L)] - J2{c,ix)+6L) < 

tt J -e(i-a)5^ 6 

so rearranging terms and using Observation ^ gives 



E 



E 



ct{xt) — min ct{x) < r^^^ + 36Ln + 2aCn. 

^-^ x&S 

t=l t=l 



(10) 



Plugging in L = gives, 
E 



E, ,1 . v-^ , , ^ RdC^/n 6 6Cn 
Ct{xt) - mm > ct[x) < ! h a2Cn. 



t=i 



t=i 



a r 



This expression is of the form f + + ca. Setting ^ = \/ ^ and a 



gives a value of 3v^ abc. 



The lemma is achieved for a = RdC^/n, b = QCn/r and c = 2Cn. □ 
Theorem 2. If each ct is L -Lipschitz, then for n sufficiently large and v = a = ^, and 



x_„-.25 / RdCr 
" - Y 3(Lr-+C) ■ 



E 



Vci(xt) -min^Q(x) < 2n^/'^ ^mdC{L + C /r). 



t=i 



t=i 



Proof. The proof is quite similar to the proof of Theorem^ Again we check that the points xt £ S, 
which it is for n is sufficiently large. We now have a direct Lipschitz constant, so we can use it 
directly in Eq. (|lflj) . Plugging this in with chosen values of a and 5 gives the lemma. □ 
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3.2 Reshaping 

The above regret bound depends on ii/r, which can be very large. To remove this dependence (or 
at least the dependence on 1/r), we can reshape the body to make it more "round." 

The set S, with rB C S* C RE can be put in isotropic position JT^. Essentially, this amounts to 
estimating the covariance of random samples from the body and applying an affine transformation 
T so that the new covariance matrix is the identity matrix. 

A body T{S) C M"^ in isotropic position has several nice properties, including B C T{S) C cflB. 
So, we first apply the preprocessing step to find T which puts the body in isotropic position. This 
gives us a new R' = d and r' = 1. The following observation shows that we can use L' = LR. 

Observation 4. Let c[{u) = ct{T^^{u)). Then c[ is LR-Lipschitz. 

Proof. Let xi,X2 G S and ui = T{xi), U2 = T{x2). Observe that, 

I4(^l) - 4(^2)1 = \ct{xi) - Ct{x2)\ < L\\xi - X2\\. 

To make this a Li?-Lipschitz condition on c^, it suffices to show that — X2II < — U2||- 

Suppose not, i.e. Ilxi — X2II > R\\ui — U2\\- Define vi = n"^""^!, and V2 = —vi. Observe that 
\\v2 — = 2, and since T{S) contains the ball of radius 1, vi,V2 G T{S). Thus, yi = T^^{vi) and 
2/2 = T~^{v2) are in S. Then, since T is affine, 

I 

hi - y2\\ =71 -\\T~^{ui - U2) - T~^{u2 - ui)\\ 



\Ui — U2\ 

2 

\ui - U2\ 

2 

\ui - U2\ 



\T-\u,)-T-\u2)\\ 
\xi — X2II > 2i?, 



where the last line uses the assumption ||xi — X2\\ > R\\ui — U2\\. The inequality \\yi — y2\\ > 2R 
contradicts the assumption that S is contained in a sphere of radius R. □ 

Many common shapes such as balls, cubes, etc., are already nicely shaped, but there exist 
MCMC algorithms for putting any body into isotropic position from a membership oracle jl6l I17j . 
(Note that the projection oracle we assume is a stronger oracle than a membership oracle.) The 
latest (and greatest) algorithm for putting a body into isotropic position, due to Lovasz and Vempala 
jl7j . runs in time 0((i'')poly-log((i, ■^). This algorithm puts the body into nearly isotropic position, 
which means that B C T{S) C l.OldB. After such preprocessing we would have r' = 1,R' = 
lMd,L' = LR, and C = C. This gives. 

Corollary 1. For a set S of diameter D, and Ct L-Lipschitz, after putting S into near- isotropic 
position, the BGD algorithm has expected regret, 



E 



i=l 



^ctixt) - min ct{x) < 6n^/^d (VCLR + C 



t=i 



Without the L-Lipschitz condition, 

■ n 

y Ci(xt) -minV q(x) < Gn^^^dC 



n 

E " ' 

- t=i 



xes ' 

t=i 



Proof. Using r' = 1,R' = l.Old, L' = LR, and C = C, In the first case, we get an expected regret 
of at mo st 2n^/'^y/ 6{1.01d)dC{LR + C). In the second case, we get an expected regret of at most 
3Cn^/^y'2{1.0ld)d. □ 
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3.3 Conclusions 



We have given algorithms for bandit onhne optimization of convex functions. Our approach is 
to extend Zinkevich's gradient descent analysis to a situation where we do not have access to the 
gradient. We give a simple trick for approximating the gradient of a function by a single sample, and 
we give a simple understanding of this approximation as being the gradient of a smoothed function. 
This is similar to a similar approximation proposed in [20]. The simplicity of our approximation 
make it straightforward to analyze this algorithm in an online setting, with few assumptions. 

Zinkevich presents a few nice variations on the model and algorithms. He shows that an adaptive 
step size r]t = 0{l/\/t) can be used with similar guarantees. It is likely that a similar adaptive step 
size could be used here. 

He also proves that gradient descent can be compared, to an extent, with a non-stationary 
adversary. He shows that relative to any sequence zi, Z2, ■ ■ ■ , Zn, it achieves, 



E 



V ctixt)] - min V ctizt) < O ( GDJn{l + S2\\zt - zt^iW)) 



Thus, compared to an adversary that moves a total distance o(n), he has regret o(n). These types 
of guarantees may be extended to the bandit setting. 

It would also be interesting to analyze the algorithm in an unconstrained setting, where issues 
of the shape of the convex set wouldn't come into play. The difficulty is that in the unconstrained 
setting we cannot assume the convex functions are bounded. However, since E[ct{xt + Sut)ut] = 
E[(^ct{xt + 5ut) — ct-i{xt-i + 5ut-i))ut], if the functions do not change too much from period to 
period, one may be able to use the evaluation of the previous period as a baseline to prevent the 
random gradient estimate from being too large. 

Acknowledgements. We would like to thank David McAllester and Rakesh Vohra for helpful 
discussions. We are particularly grateful to Rakesh Vohra for pointing us to the work of James 
Spall. 
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